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A NOISY SPEECH PARAMETER ENHANCEMENT METHOD AND APPARATUS 



The present invention relates to a noisy speech parameter enhancement method and apparatus 
that may be used in, for example noise suppression equipment in telephony systems. 



A common signal processing problem is the enhancement of a signal from its noisy 
measurement. This can for example be enhancement of the speech quality in single 
microphone telephony systems, both conventional and cellular, where the speech is degraded 
by colored noise, for example car noise in cellular systems. 

An often used noise suppression method is based on Kalman filtering, since this method can 
handle colored noise and has a reasonable numerical complexity. The key reference for 
Kalman filter based noise suppressors is [1]. However, Kalman filtering is a model based 
adaptive method, where speech as well as noise are modeled as, for example, autoregressive 
(AR) processes. Thus, a key issue in Kalman filtering is that the filtering algorithm relies on 
a set of unknown parameters that have to be estimated. The two most important problems 
regarding the estimation of the involved parameters are that (i) the speech AR parameters are 
estimated from degraded speech data, and (ii) the speech data are not stationary. Thus, in 
order to obtain a Kalman filter output with high audible quality, the accuracy and precision 
of the estimated parameters is of great importance. 



An object of the present invention is to provide an improved method and apparatus for 
estimating parameters of noisy speech. These enhanced speech parameters may be used for 
Kalman filtering noisy speech in order to suppress the noise. However, the enhanced speech 
parameters may also be used directly as speech parameters in speech encoding. 



TECHNICAL FIELD 



BACKGROUND OF THE INVENTION 



SUMMARY OF THE INVENTION 
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The above object is solved by a method in accordance with claim 1 and an apparatus in 
accordance with claim 1 1 . 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention, together with further objects and advantages thereof, may best be understood 
5 by making reference to the following description taken together with the accompanying 

drawings, in which: 

Figure 1 is a block diagram in an apparatus in accordance with the present invention; 

Figure 2 is a state diagram of a voice activity detector (VAD) used in the apparatus of 
figure i ; 

10 Figure 3 is a flow chart illustrating the method in accordance with the present invention; 

Figure 4 illustrates the essential features of the power spectral density (PSD) of noisy 
speech; 

Figure 5 illustrates a similar PSD for background noise; 

Figure 6 illustrates the resulting PSD after subtraction of the PSD in figure 5 from the 
15 PSD in figure 4; 

Figure 7 illustrates the improvement obtained by the present invention in the form of a loss 
function; and 

Figure 8 illustrates the improvement obtained by the present invention in the form of a loss 
ratio. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

In speech signal processing the input speech is often corrupted by background noise. For 
example, in hands-free mobile telephony the speech to background noise ratio may be as low 
as, or even below, 0 dB. Such high noise levels severely degrade the quality of the 
5 conversation, not only due to the high noise level itself, but also due to the audible artifacts 

that are generated when noisy speech is encoded and carried through a digital communication 
channel. In order to reduce such audible artifacts the noisy input speech may be pre-processed 
by some noise reduction method, for example by Kalman filtering [1]. 

In some noise reduction methods (for example in Kalman filtering) autoregressive (AR) 

1 o parameters are of interest. Thus, accurate AR parameter estimates from noisy speech data are 

essential for these methods in order to produce an enhanced speech output with high audible 
quality. Such a noisy speech parameter enhancement method will now be described with 
reference to figures 1-6. 

In figure 1 a continuous analog signal x(t) is obtained from a microphone 10. Signal x(t) is 
15 forwarded to an A/D converter 12. This A/D convener (and appropriate data buffering) 

produces frames {x(k)} of audio data (containing either speech, background noise or both). 
An audio frame typically may contain between 100-300 audio samples at 8000 Hz sampling 
rate. In order to simplify the following discussion, a frame length N=256 samples is 
assumed. The audio frames {x(k)} are forwarded to a voice activity detector (VAD) 14, 
20 which controls a switch 16 for directing audio frames (x(k)} to different blocks in the 

apparatus depending on the state of VAD 14. 

VAD 14 may be designed in accordance with principles that are discussed in [2], and is 
usually implemented as a state machine. Figure 2 illustrates the possible states of such a state 
machine. In state 0 VAD 14 is idle or "inactive", which implies that audio frames {x(k)} are 

2 5 not further processed. State 20 implies a noise level and no speech. State 21 implies a noise 

level and a low speech/noise ratio. This state is primarily active during transitions between 
speech activity and noise. Finally, state 22 implies a noise level and high speech/noise ratio. 
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An audio frame {x(k)} contains audio samples that may be expressed as 



xlk) = sik) + v(k) k = l,...,N 



(1) 



where x(k) denotes noisy speech samples, s(k) denotes speech samples and v(k) denotes 
colored additive background noise. Noisy speech signal x(k) is assumed stationary over a 
frame. Furthermore, speech signal s(k) may be described by an autoregressive (AR) model 
of order r 



sik) = -£ CtSik-i) + w s (k) 



(2) 



where the variance of w $ (k) is given by o,\ Similarly, v(k) may be described by an AR model 
of order q 



r{k) = -^i^vdc-i) +w v {k) 



(3) 



where the variance of w v (k) is given by o v 2 . Both r and q are much smaller than the frame 
length N. Normally, the value of r preferably is around 10, while q preferably has a value 
10 in the interval 0-7, for example 4 (q=0 corresponds to a constant power spectral density, 

i.e. white noise). Further information on AR modelling of speech may be found in [3]. 



15 



Furthermore, the power spectral density 9 X (co) of noisy speech may be divided into a sum 
of the power spectral density * s (u>) of speech and the power spectral density * v (a>) of 
background noise, that is 



4> (<*>) = +$> (u>) 



(4) 



from (2) it follows that 
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5 

* # ( U ) = j^i 

Similarly from (3) it follows that 



*„(u» = 

m=>l 

From (2)-(3) it follows that x(k) equals an autoregressive moving average (ARMA) model 
with power spectral density S> x (<*>) . An estimate of <& x (g>) (here and in the sequel 
estimated quantities are denoted by a hat can be achieved by an autoregressive (AR) 
model, that is 



A ( W ) * ± 



-iuoi|2 



(7) 



where { a a } and a x 2 are the estimated parameters of the AR model 



x(k) = -Ta^ik-i) + w x (k) (8) 



where the variance of w^k) is given by a x 2 , and where r<p<N. It should be noted that 
<& x ( g) ) in (7) is not a statistically consistent estimate of * x ( o ) . In speech signal processing 
10 this is, however, not a serious problem, since x(k) in practice is far from a stationary process. 

In figure 1, when VAD 14 indicates speech (states 21 and 22 in figure 2) signal x(k) is 
forwarded to a noisy speech AR estimator 18, that estimates parameters a x 2 , {aj in equation 
(8). This estimation may be performed in accordance with [3] (in the flow chart of figure 3 
this corresponds to step 120). The estimated parameters are forwarded to block 20, which 
15 calculates an estimate of the power spectral density of input signal x(k) in accordance with 

equation (7) (step 130 in fig. 3). 



WO 97/28527 



PCT/SE97/00124 



6 

It is an essential feature of the present invention that background noise may be treated as 
long-time stationary, that is stationary over several frames. Since speech activity is usually 
sufficiently low to permit estimation of the noise model in periods where s(k) is absent, the 
long-time stationarity feature may be used for power spectral density subtraction of noise 
during noisy speech frames by buffering noise model parameters during noise frames for later 
use during noisy speech frames. Thus, when VAD 14 indicates background noise (state 20 
in figure 2), the frame is forwarded to a noise AR parameter estimator 22 t which estimates 
parameters a v z and {b t } of the frame (this corresponds to step 140 in the flow chart in figure 
3). As mentioned above the estimated parameters are stored in a buffer 24 for later use during 
a noisy speech frame (step 150 in fig. 3). When these parameters are needed (during a noisy 
speech frame) they are retrieved from buffer 24. The parameters are also forwarded to a 
block 26 for power spectral density estimation of the background noise, either during the 
noise frame (step 160 in fig. 3), which means that the estimate has to be buffered for later 
use, or during the next speech frame, which means that only the parameters have to be 
buffered. Thus, during frames containing only background noise the estimated parameters are 
not actually used for enhancements purposes. Instead the noise signal is forwarded to 
attenuator 28 which attenuates the noise level by, for example, 10 dB (step 170 in fig. 3). 

The power spectral density (PSD) estimate $ x (u>) , as defined by equation (7), and the PSD 
estimate <& v (<*>) , as defined by an equation similar to (6) but with signs over the AR 
parameters and o v 2 , are functions of the frequency to. The next step is to perform the actual 
PSD subtraction, which is done in block 30 (step 180 in fig. 3). In accordance with the 
invention the power spectral density of the speech signal is estimated by 

<& s (o) = <& x (u>) -6* v (a)) (9) 

where & is a scalar design variable, typically lying in the interval 0<6<4. In normal cases 
6 has a value around 1 (6=1 corresponds to equation (4)). 



# 
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It is an essential feature of the present invention that the enhanced PSD $ s ( g> ) is sampled 
at a sufficient number of frequencies a> in order to obtain an accurate picture of the enhanced 
PSD. In practice the PSD is calculated at a discrete set of frequencies, 



This feature is further illustrated by figures 4-6. Figure 4 illustrates a typical PSD estimate <& x ( u> ) 
of noisy speech. Figure 5 illustrates a typical PSD estimate $ v (<o) of background noise. In 
this case the signal-to-noise ratio between the signals in figures 4 and 5 is 0 dB. Figure 6 
illustrates the enhanced PSD estimate <& s (o>) after noise subtraction in accordance with 
equation (9), where in this case 6=1. Since the shape of PSD estimate <& s ( o> ) is important 
for the estimation of enhanced speech parameters (will be described below), it is an essential 
feature of the present invention that the enhanced PSD estimate <& s (o>) is sampled at a 
sufficient number of frequencies to give a true picture of the shape of the function (especially 
of the peaks). 

In practice $ s (g>) is sampled by using expressions (6) and (7). In, for example, expression 
(7) $ x (u) may be sampled by using the Fast Fourier Transform (FFT). Thus, 1, a lv a 2 ..., 
ap are considered as a sequence, the FFT of which is to be calculated. Since the number of 
samples M must be larger than p (p is approximately 10-20) it may be necessary to zero pad 
the sequence. Suitable values for M are values that are a power of 2, for example, 64, 128, 
256. However, usually the number of samples M may be chosen smaller than the frame 
length (N=256 in this example). Furthermore, since $ s (g>) represents the spectral density 
of power, which is a non-negative entity, the sampled values of $ s ( o> ) have to be restricted 




(10) 



see [3], which gives a discrete sequence of PSD estimates 



(11) 



# 
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to non-negative values before the enhanced speech parameters are calculated from the sampled 
enhanced PSD estimate $ s (<*>) . 

After block 30 has performed the PSD subtraction the collection ($ s (/n) } of samples is 
forwarded to a block 32 for calculating the enhanced speech parameters from the PSD- 
estimate (step 190 in fig. 3). This operation is the reverse of blocks 20 and 26, which 
calculated PSD-estimates from AR parameters. Since it is not possible to explicitly derive 
these parameters directly from the PSD estimate, iterative algorithms have to be used. A 
general algorithm for system identification, for example as proposed in [4], may be used. 

A preferred procedure for calculating the enhanced parameters is also described in the 
APPENDIX. 

The enhanced parameters may be used either directly, for example, in connection with speech 
encoding, or may be used for controlling a filter, such as Kalman filter 34 in the noise 
suppressor of figure 1 (step 200 in fig. 3). Kalman filter 34 is also controlled by the estimated 
noise AR parameters, and these two parameter sets control Kalman filter 34 for filtering 
frames {x(k)} containing noisy speech in accordance with the principles described in [1]. 

If only the enhanced speech parameters are required by an application it is not necessary to 
actually estimate noise AR parameters (in the noise suppressor of figure 1 they have to be 
estimated since they control Kalman filter 34). Instead the long-time stationarity of 
background noise may be used to estimate 4 v (<a>) . For example, it is possible to use 



where $ v (o>) (m) is the (running) averaged PSD estimate based on data up to and including 
frame number m, and 4> v (o>) is the estimate based on the current frame (* v {<*)) may be 
estimated directly from the input data by a periodogram (FFT)). The scalar p G (0,1) is 
tuned in relation to the assumed stationarity of v(k). An average over r frames roughly 
corresponds to p implicitly given by 



(12) 
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(13) 



Parameter p may for example have a value around 0,95. 

In a preferred embodiment averaging in accordance with (12) is also performed for a 
parametric PSD estimate in accordance with (6). This averaging procedure may be a part of 
block 26 in fig. 1 and may be performed as a part of step 160 in fig. 3. 

In a modified version of the embodiment of fig. 1 attenuator 28 may be omitted. Instead 
Kalman filter 34 may be used as an attenuator of signal x(k). In this case the parameters of 
the background noise AR model are forwarded to both control inputs of Kalman filter 34, but 
with a lower variance parameter (corresponding to the desired attenuation) on the control 
input that receives enhanced speech parameters during speech frames. 

Furthermore, if the delays caused by the calculation of enhanced speech parameters is 
considered too long, according to a modified embodiment of the present invention it is 
possible to use the enhanced speech parameters for a current speech frame for filtering the 
next speech frame (in this embodiment speech is considered stationary over two frames). In 
this modified embodiment enhanced speech parameters for a speech frame may be calculated 
simultaneously with the filtering of the frame with enhanced parameters of the previous 
speech frame. 

The basic algorithm of the method in accordance with the present invention may now be 
summarized as follows: 

In speech pauses do 

estimate the PSD <fc v (o)) of the background noise for a set of M frequencies. 
Here any kind of PSD estimator may be used, for example parametric or non- 
parametric (periodogram) estimation. Using long-time averaging in accordance 
with (12) reduces the error variance of the PSD estimate. 
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For speech activity: in each frame do 

based on {x(k)} estimate the AR parameters {a>} and the residual error variance 
a x 2 of the noisy speech. 

based on these noisy speech parameters, calculate the PSD estimate <& x (o>) of 
the noisy speech for a set of M frequencies. 

based on tf> x ( <o ) and <£ v ( g> ) , calculate an estimate of the speech PSD 4> s ( a> ) 
using (9). The scalar 6 is a design variable approximately equal to 1. 

based on the enhanced PSD $ s ( a> ) , calculate the enhanced AR parameters and 
the corresponding residual variance. 

Most of the blocks in the apparatus of fig. 1 are preferably implemented as one or several 
micro/signal processor combinations (for example blocks 14, 18, 20, 22, 26, 30 , 32 and 34). 

In order to illustrate the performance of the method in accordance with the present invention, 
several simulation experiments were performed. In order to measure the improvement of the 
enhanced parameters over original parameters, the following measure was calculated for 200 
different simulations 



„ 200 

v = -±-Y 



M 

£[log(*(Jc) ) -log <•,(*) )] 2 



glog(*,(/c)>' 



On) 



(14) 



This measure (loss function) was calculated for both noisy and enhanced parameters, i.e. 
$(/c) denotes either $ x (k) or & 8 (k) . In (14), ( ) (ra> denotes the result of simulation 
number m. The two measures are illustrated in figure 7. Figure 8 illustrates the ratio between 
these measures. From the figures it may be seen that for low signal-to-noise ratios (SNR< 
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15 dB) the enhanced parameters outperform the noisy parameters, while for high signal-to- 
noise ratios the performance is approximately the same for both parameter sets. At low SNR 
values the improvement in SNR between enhanced and noisy parameters is of the order of 
7 dB for a given value of measure V. 

It will be understood by those skilled in the art that various modifications and changes may 
be made to the present invention without departure from the spirit and scope thereof, which 
is defined by the appended claims. 
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APPENDIX 

In order to obtain an increased numerical robustness of the estimation of enhanced 
parameters, the estimated enhanced PSD data in (11) are transformed in accordance with the 
following non-linear data transformation 

f = (f (l) ,?(2) ,...,y(M)) T (15) 



where 



-log(<&Jic)) * s (ic)>e 
T -log(e) $ s (k)ze 



(16) 



and where € is a user chosen or data dependent threshold that ensures that y (k) is real 
valued. Using some rough approximations (based on a Fourier series expansion, an 
assumption on a large number of samples, and high model orders) one has in the frequency 
interval of interest 



E[$ s (i) -<& s (i)] [* s (ic) -9 3 (k)] = 



— *i(/e) k=i 



N 

0 



(17) 



k*i 



10 



Equation (17) gives 



E[<}(i) -Y<i)] [?(*) -y(k)} - 



2r 



Jc=i 



N 

0 k*i 



(18) 



In (18) the expression y (k) is defined by 

y{k) =£[?(ic)) = -log(o') +log<il + £ c m e " | 2 ) 



# 
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Assuming that one has a statistically efficient estimate f, and an estimate of the correspond- 
ing co variance matrix P r , the vector 

v = (a 2 c c c ) 7 ( 20 > 

and its covariance matrix P x may be calculated in accordance with 



Gik) = 



ax 



i r 



ilk+i) = %ik) +P x (^)<?(^)Pf l [f , -r(jt(/c) )] 



(21) 



with initial estimates f, P r and x(0). 



In the above algorithm the relation between T(%) and % is given by 



r<x> = (y(D ,y(2) ,...,y(M)) T 



(22) 



where y (k) is given by (19). With 



{ dy(k)\ 



dy(k) 



dy(k) 



dc 2 
dy\k) 



\ 9c r ) 



2Re 



2itk 



. 2«<c _ 
-J — In 



1 + E C "> e " 



ZRe 



e 



in»l 



(23) 
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the gradient of T ( x ) with respect to x is given by 



ex 



= !T 1( f ?„) 



(24) 



The above algorithm (21) involves a lot of calculations for estimating P r . A major part of 
these calculations originates from the multiplication with, and the inversion of the (M x M) 
matrix P r . However, P r is close to diagonal (see equation (18)) and may be approximated 
by 



P r = -^J = const-I 



(25) 



where I denotes the (M x M) unity matrix. Thus, according to a preferred embodiment the 
following sub-optimal algorithm may be used 



G{k) = 



ar(x) | 



(26) 



t(k*i) = %{k) * [G{k)G T (k)Y 1 Glk)[P-r(x'(k) )} 



with initial estimates T and X(0). In (26), G(k) is of size ((r+1) x M). 



# 
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CLAIMS 

1. A noisy speech parameter enhancement method, characterized by 

determining a background noise power spectral density estimate at M frequencies, 
where M is a predetermined positive integer, from a first collection of background noise 
5 samples; 

estimating p autoregressive parameters, where p is a predetermined positive integer 
significantly smaller than M, and a first residual variance from a second collection of noisy 
speech samples; 

determining a noisy speech power spectral density estimate at said M frequencies from 
10 said p autoregressive parameters and said first residual variance; 

determining an enhanced speech power spectral density estimate by subtracting said 
background noise spectral density estimate multiplied by a predetermined positive factor from 
said noisy speech power spectral density estimate; and 

determining r enhanced autoregressive parameters, where r is a predetermined positive 
15 integer, and an enhanced residual variance from said enhanced speech power spectral density. 

2. The method of claim 1, characterized by restricting said enhanced speech power spectral 
density estimate to non-negative values. 

3. The method of claim 2, characterized by said predetermined positive factor having a value 
in the range 0-4. 

20 4. The method of claim 3, characterized by said predetermined positive factor being 

approximately equal to 1 . 

5. The method of claim 4, characterized by said predetermined integer r being equal to said 
predetermined integer p. 



WO 97/28527 



PCT/SE97/00124 



17 

6. The method of claim 5, characterized by 

estimating q autoregressive parameters, where q is a predetermined positive integer 
smaller than p t and a second residual variance from said first collection of background noise 
samples; 

5 determining said background noise power spectral density estimate at said M 

frequencies from said q autoregressive parameters and said second residual variance. 

7. The method of claim 1 or 6, characterized by averaging said background noise power 
spectral density estimate over a predetermined number of collections of background noise 
samples. 

10 8. The method of any of the preceding claims, characterized by using said enhanced 

autoregressive parameters and said enhanced residual variance for adjusting a filter for 
filtering a third collection of noisy speech samples. 

9. The method of claim 8, characterized by said second and said third collection of noisy 
speech samples being the same collection. 

15 10. The method of claim 8 or 9, characterized by Kalman filtering said third collection of 

noisy speech samples. 

11. A noisy speech parameter enhancement apparatus, characterized by 

means (22, 26) for determining a background noise power spectral density estimate 
at M frequencies, where M is a predetermined positive integer, from a first collection of 
20 background noise samples; 

means (18) for estimating p autoregressive parameters, where p is a predetermined 
positive integer significantly smaller than M, and a first residual variance from a second 
collection of noisy speech samples; 

means (20) for determining a noisy speech power spectral density estimate at said M 
25 frequencies from said p autoregressive parameters and said first residual variance; 
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means (30) for determining an enhanced speech power spectral density estimate by 
subtracting said background noise spectral density estimate multiplied by a predetermined 
positive factor from said noisy speech power spectral density estimate; and 

means (32) for determining r enhanced autoregressive parameters, where r is a 
5 predetermined positive integer, and an enhanced residual variance from said enhanced speech 

power spectral density estimate. 

12. The apparatus of claim 11, characterized by (30) means for restricting said enhanced 
speech power spectral density estimate to non-negative values. 

13. The apparatus of claim 12, characterized by 

10 means (22) for estimating q autoregressive parameters, where q is a predetermined 

positive integer smaller than p, and a second residual variance from said first collection of 
background noise samples; 

means (26) for determining said background noise power spectral density estimate at 
said M frequencies from said q autoregressive parameters and said second residual variance. 

15 14. The apparatus of claim 11 or 13, characterized by means (26) for averaging said 

background noise power spectral density estimate over a predetermined number of collections 
of background noise samples. 

15. The apparatus of any of the preceding claims, characterized by means (34) for using said 
enhanced autoregressive parameters and said enhanced residual variance for adjusting a filter 

20 for filtering a third collection of noisy speech samples. 

16. The apparatus of claim 15, characterized by a Kalman filter (34) for filtering said third 
collection of noisy speech samples. 

17. The apparatus of claim 15, characterized by a Kalman filter (34) for filtering said third 
collection of noisy speech samples, said second and said third collection of noisy speech 

25 samples being the same collection. 
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